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Small-angle X-ray scattering (SAXS) is an experimental technique that allows 
structural information on biomolecules in solution to be gathered. High-quality 
SAXS profiles have typically been obtained by manual merging of scattering 
profiles from different concentrations and exposure times. This procedure is 
very subjective and results vary from user to user. Up to now, no robust 
automatic procedure has been published to perform this step, preventing the 
application of SAXS to high-throughput projects. Here, SAXS Merge, a fully 
automated statistical method for merging SAXS profiles using Gaussian 
processes, is presented. This method requires only the buffer-subtracted SAXS 
profiles in a specific order. At the heart of its formulation is non-linear 
interpolation using Gaussian processes, which provides a statement of the 
problem that accounts for correlation in the data. 
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1 . Introduction 

Small-angle X-ray scattering (SAXS) is a popular experiment 
that allows low-resolution structural information on bio- 
molecules in solution to be gathered (Jacques & Trewhella, 
2010; Rambo & Tainer, 2010; Svergun, 2010). The SAXS 
experiment allows for a wide variety of solution conditions 
and a wide range of molecular sizes. Data collection usually 
takes between seconds and minutes in a synchrotron facility, 
or up to a few hours in an in-house X-ray source (Hura et al. , 

2009) . 

The SAXS profile of a biomolecule is the subtraction of the 
SAXS profile of the biomolecule in solution minus the SAXS 
profile of the matching buffer. SAXS can be used to study a 
wide variety of biomolecules, such as proteins, RNA or DNA, 
and their complexes (Lipfert et al, 2009; Rambo & Tainer, 

2010) , under a variety of experimental conditions. Once this 
profile is obtained, it can be used for a variety of modeling 
tasks (Jacques & Trewhella, 2010; Rambo & Tainer, 2010; 
Svergun, 2010; Schneidman-Duhovny et al, 2012). It is 
essential to perform the radial averaging and buffer subtrac- 
tion steps with high accuracy, as an error at that stage would 
propagate later on. 

The SAXS profile consists of a collection of momentum 
transfer values (scattering vector) q, mean intensities I(q) and 
standard deviations s(q). Data collection for a given sample is 



often repeated a number of times N to reduce the noise (or 
standard deviation) in the SAXS profile by averaging. We 
consider N as the number of points entering the calculation of 
/ and s, because the variation between repetitions is much 
greater than that due to radial averaging of a single experi- 
ment. Additionally, we suppose that the SAXS profiles were 
collected at several sample concentrations and X-ray exposure 
times. Both higher concentration and longer X-ray exposure 
times can provide more information at higher scattering 
angles. However, both conditions influence the resulting 
SAXS profile. At higher concentrations, particle-particle 
interactions can affect the slope of the very low angle part of 
the SAXS profile (Glatter & Kratky, 1982). At longer expo- 
sures, radiation damage can perturb any region of the SAXS 
profile (Kuwamoto et al, 2004). To remove these unwanted 
effects it is thus necessary to merge datasets from different 
experimental conditions. It is the purpose of this method to 
show that it is possible to do so automatically with minimal 
user manipulation. 

In this article we present the method behind the SAXS 
Merge webserver, a tool presented by Spill et al. (2014) which 
merges SAXS profiles in a robust, completely automated and 
statistically principled way. While the method was tested on 
SAXS datasets, it can also be applied for merging small-angle 
neutron scattering (SANS) datasets, because the basic equa- 
tions and methods are similar for the two techniques (Svergun, 
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2010). SAXS Merge consists of five steps: data clean-up, profile 
fitting using Gaussian processes, rescaling of each profile to a 
common reference, classification and detection of incompa- 
tible regions, and merging of the remaining data points. 
The resulting object is a probability distribution function 
describing the merged SAXS profile. Resulting data consist of 
the experimental points that are compatible with the distri- 
bution, a maximum posterior estimate of the SAXS profile 
across all experiments along with a credible interval, and 
estimates of a number of parameters such as the radius of 
gyration and the Porod exponent. 

2. Five steps for SAXS merging 

SAXS Merge consists of five sequential steps: (i) data clean-up, 
(ii) profile fitting using Gaussian processes, (iii) rescaling of 
each profile to a common reference, (iv) classification and 
detection of incompatible regions, and (v) merging of the 
remaining data points. The first three steps are performed 
separately on all input SAXS profiles. We now go through 
each of these five steps sequentially. 

2.1. Data clean-up 

In this step, we remove from input SAXS profiles data 
values for which the expected value is not significantly 
different from zero. Let H 0 be the null hypothesis of a data 
point being purely noise-induced. Let TL l be the alternative 
that it contains some signal. Then with a type-I error of a, we 
can perform a one-sample one-sided f-test. Let be a mean 
intensity at momentum transfer q t , s(q i ) the standard deviation 
and N the number of repetitions of the experiment. Then the t 
statistic is 



t = 



Had 



(1) 



and it has a Student t distribution with v = N — 1 degrees of 
freedom. Since we are performing a large number of tests, we 
apply the Bonferroni correction by defining a = a/M (M is the 
total number of points in the SAXS profile) and choose a = 
0.05 by default. Normality of the noise is assumed, which is 
reasonable if no parameter varies across the TV replicates of an 
experiment. 

Points with no or zero standard deviation are discarded. 
Optionally, points with much larger variances than average are 
discarded as well. This option is proposed because SAXS 
profiles have almost constant s(q i ) values, except at extreme 
values for q i in which case s(q t ) diverges. This behaviour is an 
experimental artefact, and it is reasonable to remove such 
points. We therefore calculate the median s and discard points 
which have s(q i ) > 20s. 

2.2. Profile fitting using Gaussian processes 

We have a number of measurements for a SAXS profile, 
summarized by three sufficient statistics: intensity /(<?;), stan- 
dard deviation s(q,) and number of repetitions N independent 
of The SAXS profile is modelled as the noisy measurement 



of an unknown smooth function q \— > X(q) at M different data 
points. A pointwise treatment of SAXS profiles fails because 
of the high noise and correlation encountered in the 
measurements. This pointwise treatment would lead to an 
inconsistent classification [step (iv), data not shown]. It is 
crucial to account for the correlation between successive 
points to be able to detect outlier data points in a robust 
manner. Thus, we first estimate the most probable SAXS 
profile, which was measured with noise in a given SAXS 
experiment. 

This functional estimation is achieved with the help of the 
theory of Gaussian processes. Gaussian process interpolation 
(GPI) is a form of non-parametric fitting which has a 
straightforward probabilistic interpretation and provides 
confidence intervals on the functional estimate. Given some 
data and an automatically adjusting smoothness penalty, GPI 
provides the most probable function that fits the data. For 
more information on Gaussian processes, see p. 535 of 
MacKay (2003), §13.43 of O'Hagan & Forster (2004), 
Rasmussen & Williams (2006) and http://gaussianprocess.org. 

2.2.1. Likelihood. Define 



I = 



S = diag 




J = 



\iiau), 



(2) 



S is the sample covariance matrix, assumed to be diagonal 
given X. We treat I as a measurement with noise of the 
function X at positions {qj\ i= \^ so that I = X + s where s is 
a vector distributed as a multivariate normal distribution with 
zero mean and covariance matrix £. We make the assumption 
that 2 = cr 2 S, where a is a proportionality constant that will be 
estimated in the process. The assumption of a diagonal S 
matrix is not entirely correct, as shown by Breidt et al. (2012). 
However, correlations are expected to be non-zero only 
between neighbouring annuli (i.e. q values), and the covar- 
iance model we introduce next spans much further than that. 
This assumption leads to the following likelihood, 

p(I\ X, S, N) = 



(2tt) m/2 \o 2 S/N\ 1/2 
x exp[-(l/2)(I-I) T (a 2 S//V)" 1 (I- J)]. (3) 



2.2.2. Prior. The likelihood alone does not constrain the 
vector X, which is still free to vary. However, we believe that 
the function X is smooth. This belief is modelled by assuming 
that the vector X follows a multivariate normal distribution 
with mean vector m and covariance matrix W which have been 
carefully chosen (see below), 



p(J|m, W) = 



1 



(2tt) 



M/2 



w 



1/2 



exp[-(l/2)(J - m) T W _1 (2 - m)] . (4) 
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Equivalently, one can say that the function 2" has a Gaussian 
process prior distribution with prior covariance function w and 
prior mean function m. 

2.2.3. Choice of w. The covariance function determines the 
smoothness of the Gaussian process. We choose the commonly 
used squared exponential form, which yields continuous and 
infinitely differentiable functions. Therefore, this approach is 
in principle only usable on smooth SAXS profiles, 



w{q,q') = T 2 exp 



(q-q'Y 



2k 2 



(5) 



The covariance function has two parameters: r 2 is the variance 
that the Gaussian process assigns in regions far from any data 
point; k is the persistence length of the profile, in units of q. 
With this covariance function, we define 



w(q) = 



\w{q M ,q)J 
f w(<7i,<7i) 



(6) 



h<<7i,<?m) \ 



2.2.4. Choice of m. Gaussian process interpolation is a non- 
parametric approach. However, it is possible to incorporate 
some prior knowledge in the form of a parametric mean 
function, making the approach semi-parametric. In our case, 
this way of proceeding has the advantage of providing an 
estimate of the radius of gyration and other constants. At the 
same time, the Gaussian process fits the signal in the data that 
is unexplained by the parametric mean function so that even 
high deviations from the prior mean function will be followed 
by the Gaussian process. 

At very low angle, the Guinier plot allows for an estimation 
of the radius of gyration, 



I(q) oc exp 



R 2 a 



q 



(7) 



For the higher-angle portion of the profile, Porod's law is 

I(q) oc q~\ (8) 

Hammouda (2010) constructed a smooth function encom- 
passing both behaviours, which we use as a starting point for m, 

mW - A+ \D/q« ifq> qi , W 



1/2 



qi = (l/R G )[(d-s)(3-s)/2] 
D = Gqt s exp[-^ 2 /(3 - s)\ 



(10) 
(11) 



This function has five parameters: A, G, R a , d and s. Some of 
them can be fixed to certain values, generating a nested family 
of parametric functions. For example, setting G = 0 reduces m 



to a constant function. Setting d such that q x is larger than any 
input g-value reduces m to the Guinier model with a constant 
offset. Finally, setting s = 0 reduces m to the simpler Guinier- 
Porod model described in the first section of Hammouda 
(2010) (up to a constant offset). Define 



= [m(<7 1 )---m( (?M )] T . 



(12) 



2.2.5. Hyperprior. The parameters arising in the prior mean 
or covariance functions as well as a are collectively called 
hyperparameters. In this hierarchical approach we can in turn 
assign a prior to these hyperparameters. Since our knowledge 
of their plausible values is rather vague, we give a Jeffreys 
prior to a 2 and a uniform prior to the other parameters. 
However, for the sake of model comparison, parameters are 
bounded within a finite interval to allow for a normalized 
prior, 

(a 2 ) = 1 1 

log(a max /cr min ) (j 2 

x (13) 

P( P .) = pmax _ pmin P, € {G, R Q ,d,5,A,T,k}. 

2.2.6. Fitting the SAXS profile. In order to find the best fit of 
the SAXS profile, it is required to optimize the hyperpara- 
meters. Defining 0 = (G, R G , d, s, A, r, k, a) T and D = 
(I, S, N), this optimization can be achieved by maximizing 
p(0\D) with respect to 0. With the help of Bayes' rule, we 
obtain 



p(&\D) oc p(I\S,N,&)p(&), 



(14) 



where p(&) is given in equation (13) and p(l\&, S, N) is called 
the marginal likelihood, 

p(I|S, N, 0) = fp(l\l, S, N)p(l\&) dJ. (15) 

Since both the likelihood [equation (3)] and the prior [equa- 
tion (4)] appearing in this integral are multivariate Gaussian 
distributions, it is possible to give an analytical expression of 
the marginal likelihood, 

p(l\S,N,@) = Jr, w expf -^e T n _1 e), (16) 

with £ = I - m and H = a 2 S/N + W. 

2.2.7. Obtaining functional deviates. To make predictions 
of I at a new point q we average over all possible values for 0, 
weighted by their posterior probability, 



p[l(q)\D] = fp[l(q)\&, D] P (0\D)d&. 



(17) 



Let us examine the two terms appearing in this last integral. 
The posterior probability density of the hyperparameters 
p(0\D) was already encountered in equation (14). 

The remaining term, p(X(q)\&, D), is the posterior predic- 
tive probability density of a new noise-free observation given 
the hyperparameters. It is called posterior predictive because 
it allows new values of the SAXS profile given the noisy 
observations to be predicted. Since the function 1 has a 
Gaussian process prior and a multivariate normal likelihood, 
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the posterior distribution for T is also a Gaussian process, with 
mean function X and covariance function <r| given by 



Vq l(q) = m{q) + ^{q) T Qr\\ - m), 



(18) 



Vq, q' al(q, q') = w{q, q') - v*(.q) T £T^(q') . (19) 

These equations arise from the fact that the vector 
[T(q), Iiq^), . .., I(q M )] has a multivariate normal distribu- 
tion with mean vector [m(q), m(q l ), . . . , m(q M )] T and covar- 
iance matrix 



w(q, q) w(q) T 

w(q) n 



(20) 



The distribution for T(q) then results from the conditioning of 
the multivariate normal distribution on the observed values, 



p[I(q)\&,D] 



1 



(27t) 1/2 a x (q,q) 



exp 



l(q)-l(q) 



&£(q, q) 



■ (21) 



Note that it is also possible to generate random functional 
deviates from the posterior predictive distribution. If k points 
are wanted for each functional estimate, one can draw them 
from the multivariate normal distribution with mean vector 
and covariance matrix built, respectively, from the posterior 
mean function 1 and the posterior covariance function crj at 
the values q l , . . . , q k . 

Although we could in principle perform the interpolation 
by numerically integrating equation (17) for every value of q 
needed, this approach would be costly in terms of computation 
power. In fact, two integrals would need to be computed 
numerically, equation (17) and also the normalization constant 
of equation (14), 



p(I|S,A0 = fp(l\&,S,N)p(@)d@. 



(22) 



Luckily, as Gibbs & MacKay (1997) have pointed out, a 
Laplace approximation of this last integral is a very good 
approximation because hyperparameters are usually quite 
peaked around their most probable value. This approach is 
known as a type-II maximum likelihood (ML-II), 



p(I\S,N)~p(l\0,S,N)p(&)A0, 



A0 = {2nf 



d 2 E 



(i,0) 



-1/2 



E(I, 0) = -logp(I|0, S, AO - logp(&). 



(23) 



(24) 



(25) 



N = dim(0) is the number of parameters. A0 is the phase 
space volume in which values of 0 are acceptable given D, and 
is usually small (Rasmussen & Williams, 2006). This procedure 
has a considerable practical advantage, since optimization of 
the hyperparameters then does not need to be performed for 
each new T(q) but only once for this dataset. The optimization 
itself has been described in §2.2.6. 

Once the most probable 0 has been found, the Laplace 
approximation gives the normalization constant of p(0\D), 



p(@\D) 



p(I|0,S,AQp(0) 
p(I\&,S,N)p(&)A0 



(26) 



With the additional hypothesis that p[I(q), 0, D]p(&\D) has 
the same maximum for 0 as p(0\D) alone, equation (17) 
becomes 



p[l( q )\D]~p[l(q)\®,D]\l n +AB 



-1 1 -1/2 



_ [ 9 2 log p[l(q)\0,D] \ 



B = 



90 z 



d 2 E(I, 0) 
90 2 



J 0 = 0 



(27) 



(28) 



(29) 



It is also possible to compute the posterior mean and covar- 
iance functions averaged over all values of 0, 



i(q)~i(q)\l n +A'B 



-11-1/2 



al(q,q)~al(q,q)\l n +A"B 



-11-1/2 



A' = 



90- 



-logJfo) 



0 = 0 



A" 



90- 



log <ri(q,q) 



(30) 



(31) 



(32) 



(33) 



0 = 0 



2.2.8. Choice between different mean functions via model 
comparison. Sometimes, the information content of a SAXS 
profile is so low that the number of parameters in the mean 
function exceed the number of identifiable parameters of 
the SAXS profile. In that case, overfitting occurs, and it is 
preferrable to try a simpler mean function. 

The previously presented mean function has five para- 
meters. It has been noted that it generates a nested family of 
parametric functions when some parameters are held fixed. 
For globular proteins, s can be set to zero, reducing the 
number of parameters to four. It is also possible to use simpler 
functions. For example, 



m G (q) = A + G exp 



(34) 



assumes the SAXS profile only contains the Guinier region; 
it has three parameters. The flat function has one parameter: 
m F (q) = A. 

Fitting can be performed using a number of different mean 
functions. The one that is the most plausible is then selected by 
model comparison. Suppose M 1 (M 2 ) represents the model in 
which the mean and covariance functions total A^ 1 (/V p 2 ) 
parameters, summarized in the parameter vector 0 1 (0 2 ). The 
best mean function is the one which has the highest Bayes 
factor, 
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pjM^D) = P (M 1 )p(D\M 1 ) 
p(M 2 \D) p(M 2 )p(D\M 2 ) 

fp{D\& l ,M l )p{& l \M l )d& l 
fp(D\& 2 ,M 2 )p(& 2 \M 2 )d& 2 



= 1 



(35) 
(36) 



The Bayes factor is the ratio of the evidences if both models 
have an equal a priori probability. As was just discussed, we 
simplify this assumption further by performing a Laplace 
approximation of the integral around the maximum posterior 
set of parameters. This expansion yields 



p(M 2 \D) 



p{\\&„ S, N , M 1 )p(0 1 |M 1 ) 



p(I|02.S,JV,W 2 )p(0 1 |M 2 ) 

52c . I" 1 / 2 , |o2 



x 



I 



30? 2 



-1/2 



(37) 



Details of the calculation, along with gradient and hessian 
of Hammouda's Generalized Guinier Porod function 
(Hammouda, 2010), are given in the supporting information. 1 

2.3. Rescaling 

Suppose X 0 is given, and we want to find the scaling factor y 
between X 0 and X 1; such that the distance between X 0 and yX l 
is minimized under some metric. We propose three similar 
models to rescale the SAXS profiles: normal model, normal 
model with constant offset, and lognormal model (using the 
logs of the intensities rather than the intensities themselves). 
In this section we assume X t are evaluated at M points, and 
treat X i as a vector with M entries. 

In the normal model we use the squared error loss 

C = (J 0 - KJ 1 ) T A(X 0 - yXJ, (38) 
where A is a symmetric positive definite matrix. The risk is 

U ee E Xi {E Io [(J 0 - yJ 1 ) T A(X 0 - yXJ] } . (39) 
It can be put in the form 

n=(l 0 - y^Y^o - yi t ) + tr[A(L 0 + y 2 ^)], (40) 



where E ; is the covariance matrix of X t . We would like to 
choose y and A so that the risk is minimal, 

SfJZ 
~dy~ 



-l(x o - yX^j A Jj + 2y trCAEO, 



(41) 



9"K 
9A 



= (Z 0 - y^) (j 0 - yJ 1 ) T + E 0 + y 2 ^. 



(42) 



The second equation is a sum of positive matrices, and cannot 
be zero. Therefore there is no choice of A that minimizes the 
risk. We choose A = Lj -1 . Minimizing the first equation gives 
the target value for y, 



ij^i 1 x 0 
xJ^% + m' 



(43) 



1 Supporting information for this paper is available from the lUCr electronic 
archives (Reference: CO5036). 



The mean vectors are computed from equation (18) or (30); 
the covariance matrices from equation (19) or (31). 
The normal model with offset has loss function 

C ee [J 0 - yil, + c J)] T Lr 1 [J 0 - y(l x + cJ)] , (44) 
where J is a vector of ones. This model leads to the estimates 

mx^j + i 1 E 1 - 1 (i 1 i 0 T - ioi^ij 



J^-^jT.JjT^lj 



XlT^\X x +cJ) + M 
Finally the lognormal model has loss function 



C = 



J logy 



sr 1 



J logy -log 



(45) 



(46) 



(47) 



which is defined because the intensities are expected to be 
positive. The estimate for y is then 



logy 



[log^o/J^jV'i 



J T ^J 



(48) 



By default, all profiles are rescaled to the last profile, which 
has usually the widest range. 

2.4. Classification 

To classify the SAXS profiles, it is necessary to rank them. 
SAXS profiles are ranked as follows. For each profile i, we 
compute 7,(0) by fitting the Guinier region, and the median s ; 
of the errors. We use the median instead of the mean because 
it is more robust to outliers. The profiles are then ranked by 
ascending 7,(0)/5,. This quantity is expected to increase with 
either concentration or X-ray dose. 

The first profile has the reference status on all intervals that 
have not been discarded by the first step (i.e. as long as its 
signal-to-noise ratio is sufficiently high). Let X be the candi- 
date profile, and X ref the reference profile, for which we have 
just derived a distribution in the fitting step. Because corre- 
lation has been accounted for in the profile fitting step (§2.2.6), 
pointwise statistical treatment is sufficient. The SAXS profiles 
are then compared by using a two-sample two-sided t test and 
regions of disagreement are determined. 

We would like to know which measurements of X(q) and 
X lef (q) are compatible. We simply assume that each new 
observation at scattering angle q is drawn from a normal 
distribution with mean /x(g) and standard deviation cr exp (g), 
where 



H(q) = yX(q), 
<4p(?) = 9 2 ^{q)- 



(49) 



(50) 



X(q) and a x (q) are given by equations (30) and (31) and y by 
equations (48), (46) or (43). If no parameter averaging was 
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performed, one can use T and a T instead of T and <j t given by 
equations (18) and (19), respectively. 

We then perform Welch's two-sample two-sided f-test at 
confidence level a (Welch, 1947). Similar to §2.1, we compute 
the t statistic 



t = 



[<rl p (q)/N+a^ tet (q)/N Kt ] 



1/2 



(51) 



with N and N lel the number of repetitions of each experiment. 
The degrees of freedom are given by the Satterthwaite 
approximation, 

v = [a e 2 xp ( g )/jV + a e 2 xp , ref (g)/jV ref ] 2 

[al p (q)/N] 2 /(N - 1) + [a e 2 xp>re£ (?,)/A' ref ] 2 /(^ r ef " 1) 

(52) 



If the p-value of this test is smaller than a then the functions 
are locally different and Z(g ; ) is discarded. 

Usually, the second profile spans a wider q range, so that 
comparison with the reference profile cannot be carried out at 
higher angles. In such a case the remaining portion of the 
second profile is marked as valid, and becomes the reference. 
Next, the third profile is compared with the low-angle part of 
the first profile and with the high-angle part of the second 
profile. If the third profile spans a wider q range than the 
second profile, its tail becomes the reference for the remaining 
q values, and so on until all SAXS profiles have been 
compared. 



2.5. Merging 

The merging step simply consists of pooling all compatible 
data points, keeping track of their origins. Gaussian process 
interpolation is then performed on this merged dataset. It can 
then happen that some datasets overlap, leading to multiple 
intensities for the same q values. In that case we discard the 
points which have the largest standard deviations. This beha- 
viour can be disabled. 



3. Conclusion 

In this article we have developed SAXS Merge, a fully auto- 
mated method for merging SAXS profiles in a robust and 
statistically principled way. It has been released as both a 
software package and a webserver, as described by Spill et 
al. (2014). The required input consists only of the buffer- 
subtracted profile files in a three-column format (q, intensity, 
standard deviation). 

YGS thanks Riccardo Pellarin for discussion about Baye- 
sian scoring. MN acknowledges funding from the European 
Union (FP7-IDEAS-ERC 294809). 
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